In this markdown, I will look at ride-sharing data provided by the city of Chicago, IL for the first week of November 2018. The entire data set up till May 2019 is approximately 6.5GB. In the interest of time and limited processing power, I decided to just look at the a small segment of the entire dataset, which in itself is not insignificant.
Chicago (brown) is located in Cook County, IL as shown in the map below.
The data provided by ride-sharing companies to the City of Chicago were anonymized in the following ways:
1. Drop-off and pick-up locations correspond to US Census Bureau TIGER Census Tracts.
2. Pick-up and drop-off times were rounded to the nearest 15 minute-interval.
3. Fares and tips were rounded to the nearest $2.50 and $1.00 respectively.
4. Names of the companies were hidden.
This was to protect the identity of riders. In about a third of all rides in November 2018, the identity of the rider(s) could still be determined after rounding times to the nearest 15 minute-interval and locations to the Census Tract. In these cases, the Census Tracts fields were left blank. These observations were eliminated from the following analyses.
The census tract is an appropriate geographical scale as it is small enough such that local effects are not obscured.
The rideshare data appears to show some degree of heteroskedasticity when distance is plotted against fare for unshared rides. Heteroskedasticity can be explained by the fact that this dataset contains data from different ridesharing companies each deploying their own fare-calculation algorithms. interestingly, rides taken on weekends do not appear to be any more expensive than those on weekdays. The distributions are roughly identical. If anything, it appears that unshared rides on weekdays are most likely to be the most expensive out of the four types of rides compared above.
More importantly, I am interested in the geographical distribution of riders in Chicago. Since unshared rides account for 81.9% of rides in Chicago, and pooled rides appear to display very different fare chracteristics from non-pooled rides, I will only considered unshared rides in the following analyses.
Grouping each trip by its pickup and dropoff census tract, a histogram for the frequency in which tract a passenger is picked up and dropped off from can be found. Both histograms demonstrate a significant right-skew with an extremely long tail. That is to say, most tracts experienced 50 or less pick-ups and drop-offs in the first week of November 2018.
Based on the fivenum summaries below, 75% of all census tracts had less than 50 pick-ups and drop-offs in the first 7 days of November 2018.
## [1] 0 0 8 47 5290
## [1] 0 0 8 48 6081
In fact, 93% of all pickups and dropoffs occur within a quarter of all census tracts in Chicago. This suggests that there might not be Complete Spatial Randomness (CSR) where ride-share pickups and dropoffs in the City of Chicago are concerned.
Plotting the mean number of pickups and dropoffs daily, there appears to be no difference in either the geographical distribution or relative intensity of concentration of ride-share activity. Rideshare activity is concentrated in the neighborless census tract that is Chicago O’Hare International Airport and downtown Chicago. Hence, in considering the spatial autocorrelaton of rideshare activity, there will be no differentiation between weekend and weekday rides.
Two spatial weight matrices were used: a first-order queen contiguity, and k-nearest neighbor (k = 4).
Queen contiguity was selected as it is the more permissive of the two contiguity methods. This takes into account the effect of contiguity from all possible directions and border lengths, even if it is just a single vertex. However, census tracts come in different shapes and sizes. Compared to rook contiguity, the distance threshold for queen contiguity matrices are higher. In other words, distance decay is assumed to be less of an issue. Furthermore, given that street corners and intersections, upon which census tract boundaries are based, tend to be right angles. This increases the odds of two riders diagonally across each other at an intersection being classed into two separate census tracts. In cases like these, queen contiguity should be used.
The second weight matrix is the k-nearest neighbor matrix where k = 4, an arbitrary number that should eliminate local noise from too low a k-value while still considering local effects, the latter of which might be obscured with a higher k-value. K-nearest neighbors was picked because of the uneven density of census tracts. Census tracts come in different shapes and sizes. It is not inconceivable for a parcel to have four contiguous neighbors. However, one if them is bigger than the next three combined. In such a situation, is it truly statistically precise to argue that points within the large tract experience the effect of distance decay to the same magnitude as those in the smaller, and closer tracts? Probably not. Further, there are neighborless tracts, such as the one containing Chicago O’Hare airport to the northwest of all other tracts, that are not adequately captured by contiguity-based spatial weight matrices. Finally, Tobler’s first law of geography states that closer things are more related than farther ones. The concepts of “closer” and “farther” are relative to the size of the census tract. Thus, absolute distance thresholds are less suited for the uneven distribution of census tracts in Chicago; k-nearest neighbors remains one of the better options.
A first-order row standardized queen contiguity spatial weights matrix was created. The census tracts file used to generate this matrix includes pick-up and drop-off numbers shown in the histogram above.
The Global Moran’s I plot for pick-ups below shows a strong positive relationship between the number of pickups in a census tract and is neighboring census tracts. A strong positive relationship is evident with strong clustering in quadrants one and three, although the scatter in quadrant three is limited. This suggests that census tracts with a high number of pickups are more likely to be contiguous to other census tracts with a high number of pick-ups as well. In other words, the number of pick-ups in census tracts are spatially autocorrelated. The Moran’s I value is 0.6497119 which is consistent with strong spatial autocorrelation. The results from a Monte Carlo simulation with 499 permutations and a p-value of 0.02 suggests that the outcome of the Moran’s I test is statistically significant and unlikely to be random.
The moran.plot function was not used for aesthetic reasons
##
## Monte-Carlo simulation of Moran I
##
## data: Ridercounts.sp$Pickup
## weights: ChicagoWeightsRSM
## number of simulations + 1: 500
##
## statistic = 0.64971, observed rank = 500, p-value = 0.002
## alternative hypothesis: greater
Repeating the same for drop-offs, the Moran’s I value is 0.6018465 which is consistent with strong spatial autocorrelation.
##
## Monte-Carlo simulation of Moran I
##
## data: Ridercounts.sp$Dropoff
## weights: ChicagoWeightsRSM
## number of simulations + 1: 500
##
## statistic = 0.60185, observed rank = 500, p-value = 0.002
## alternative hypothesis: greater
Pivoting to the Local Indicator of Spatial Association (LISA) for pickups and dropoffs in Chicago using Geoda, the following significance maps can be derived.
Unsurprisingly, for both pickups and dropoffs, census tracts with a high level of ridership in downtown Chicago are co-located and surrounded by census tracts that report similarly high number of pickups and dropoffs. More interestingly, about a quarter of all census tracts report low levels of ridership and are equally surrounded by low levels of ridership. These low-low tracts are substantially more significant that the high-high. This suggest that the leverage of the high-high tracts may have skewed the Moran’s I results.
By selecting the four nearest neighbors of each census tract, a connectivity map of Chicago’s census tracts can be generated, as seen below. In this instance, even non-contiguous exclaves such as the census tract that Chicago O’Hare International Airport sits on are connected with other tracts. The global Moran’s I value from the k-nearest neighbor analysis is 0.56184. This is weaker than the results returned by the contiguity matrix. However, the results remain statistically significant and strongly suggest that autocorrelation is present.
##
## Monte-Carlo simulation of Moran I
##
## data: Ridercounts$Pickup
## weights: ChicagoKNNRSM
## number of simulations + 1: 500
##
## statistic = 0.56184, observed rank = 500, p-value = 0.002
## alternative hypothesis: greater
Similarly, the geographical concentration of dropoffs is more likely than not to be autocorrelated. These results are statistically significant as proven by the Monte Carlo simulation.
##
## Monte-Carlo simulation of Moran I
##
## data: Ridercounts.sp$Dropoff
## weights: ChicagoKNNRSM
## number of simulations + 1: 500
##
## statistic = 0.51153, observed rank = 500, p-value = 0.002
## alternative hypothesis: greater
The two LISA cluster maps are similar to the LISA cluster maps drawn up based on a queen contiguity spatial weights matrix. Thus, the data lend themselves to the conclusion that rideshares cluster in downtown Chicago and suburbs to the south, regardless of day of week. The intense clustering of rideshare activity has more likely than not leveraged the Moran’s I plot.
Itis not entirely clear from the ridesharing dataset itself what is causing the clustering. Intuitively, it could be due to the concentration of economic activity and footfall within downtown Chicago that is causing these observations. A more in-depth study can be performed with the relevant economic data.
Intriguingly, GeoDa and the localmoran function returned different results as to the significance of the observations of statistical clustering even though all the parameters selected and input were identical.
These analyses further point towards how some areas rely on ridesharing companies, such as Uber and Lyft, a lot more than others. The question is why? Does this mean that rideshares are in competitive supply with other forms of private transportation, and public transit? Further, the high number of unshared rides and intense clustering of rideshare activity appears, at first glance, to support the argument that ridesharing companies have contributed to congestion, instead of alleviating it, as was their premise when they were first launched. Expandng on these analyses, a larger and longer dataset will provide more accurate results. Further, more data can also enable the reverse engineering of Uber and Lyft’s fare algorithm.